If you’re reading this for the 1st time, I recommend you read the Objectives and look at the resulting figures at the end of the document to get an idea of what I’m doing before digging too deeply into the methods; that said, DO dig into the methods afterwards



Objective

Cluster high-flow event characteristics and antecedent watershed conditions to evaluate how these factors converge as flux regimes (clusters) to produce variability in:
1. Event NO3 yields
2. Event SRP yields
3. Event turbidity yields
4. Event NO3:SRP yield ratios


Select variables to keep

Per K Underwood: if two variables are strongly correlated (negatively or positively) they can effectively “double-weight” a particular factor important in driving clustering; thus, keep just one of the variables to serve as a proxy for that factor

Decisions for eliminating variables w/ correlations >70%

These were used for the 2020-12-10 run:

  • These decisions were tough to make and need review
  • Trying to find a VWC variable that correlates well with GW level so that I can remove GW level vars (no GW data in 2017)
    • gw_4d_allWells and VWC_pre_wet_30cm most highly correlated (84%), but they were better correlated at Hungerford (94.5%). VWC_pre_wet_15cm and gw_4d_allWells slightly less correlated (80.2%), but were much better correlated at Hungerford (94.2%). At Hungerford I kept only VWC_pre_wet_15cm to serve as the GW proxy even though the 30cm equivalent was slightly better correlated to keep all soil vars (e.g., temp, DO, redox); I’m going to take the same approach at Wade (keep only VWC_pre_wet_15cm for now)
Decisions for eliminating variables w/ correlations >70% (contiuned)
  • Tough decisions that need review (continued)
    • SoilTemp_pre_wet_15cm and VWC_pre_wet_15cm are correlated (73.7%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters
    • q_event_max and q_mm are highly correlated (74.3%), but less than at Hungerford (83.8%); let’s drop q_event_max for now, because?
    • DOY and NO3_1d (mean stream NO3 conc 1 day prior to event start) are negatively correlated (-73.7%), keeping both for now, but should I?

  • Tough decisions that need review (continued)
    • rain_event_total_mm and API_4d are correlated (71.9%), leaving both for now, should I?

  • Tough decisions that need review (continued)
    • q_4d and DO_pre_wet_15cm are negatively correlated (-71.5%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters
    • rain_event_total_mm and turb_event_max are correlated (70.4%); but the relationship isn’t great and it’s right at my arbitrary 70% correlation cutoff so leaving both for now

  • I feel OK about these decisions, but they should be reviewed as well
  • If the 1-d and 4-d values for a variable are highly correlated, use the 4-d value
    • gw_1d_allWells and gw_4d_allWells are highly correlated (90.9%); remove gw_1d_allWells
  • MET variables
    • airT_1d and airT_4d are highly correlated (92.7%); removing airT_1d
    • airT_4d and dewPoint_4d are highly correlated (96.3%), as is dewPoint_1d (92.4%); removing both dewPoints
    • airT_4d and SoilTemp_pre_wet_15cm are highly correlated (92.7%); drop airT_4d b/c we still have diff_airT_soilT & soilT follows similar annual arc
    • solarRad_1d and solarRad_4d are highly correlated (72.7%); drop solarRad_1d per rule above
  • Q
    • q_1d and q_4d are highly correlated (91.1%), so sticking with rule above will keep the q_4d
    • q_event_delta & q_event_max are highly correlated (87.3%); I kept q_event_max for Hungerford, so let’s keep it vs delta
    • Drop q_event_dQRate_cmsPerHr b/c it’s confusing and hopefully q_event_delta or rain intensity will capture this
  • Rain
    • Drop all the rain_Xd vars, b/c API_4d should cover this, though would be interesting to test how many days pre-event (e.g., 4 days for API) matters
    • Unlike at Hungerford where they were correlated ((74.4%)), rain_int_mmPERmin_mean and rain_int_mmPERmin_max are less correlated at Wade (54.6%), so leaving both
  • Redox
    • If redox variables prove not to be important or they are highly correlated with another variable, we can remove them and increase n obs
  • Stream
    • turb_1d proved to be unuseful in driving clusters in SOM, so I removed it


Look at correlations again after dropping variables

Self-organizing map (SOM)

Prepare data & set up grid/lattice dimensions

We’re only using complete observations/rows (no NAs in any columns)
According to the heuristic rule from Vesanto 2000, number of grid elements/grid size/nodes = 5 * sqrt(n)
To determine the the shape of the grid (ratio of columns to rows), we use the ratio of the first two eigen values of the input data set as recommended by Park et al. 2006

## [1] No. of complete observations: 97 out of 120 observations
## [1] No. of Vesanto nodes: 49
## [1] Ratio of columns to rows: 1.3



Run SOM for a suite of grid/lattice configurations, # of nodes, and # of clusters

Code courtesy of Kristen Underwood (hidden)

## [1] Topology: hexagonal
## [1] Data normalization method used: L2norm
## [1] Weighting method used: noPCA
## [1] No. of iterations: 1000
## [1] alphaCrs used: 0.05
## [1] alphaFin used: 1000



Choose the best SOM run based on non-parametric F-stat and quantization error

We want to maximize npF (ratio of b/w cluster variance) and minimize QE (mean distance b/w each data vector & best-matching unit)
Here are the top 33% of runs based on npF

Choose the best run from each high performing (top 33%) cluster
## [1] These were the top runs for each high-performing (top 33%) cluster #:
Run rows cols Nodes Clusters npF QE
72 7 9 63 6 54.742 0.128
52 7 8 56 5 54.436 0.124
23 5 10 50 4 50.880 0.151
15 7 9 63 3 50.591 0.130
## [1] The script chose:

Run rows cols Nodes Clusters npF QE
72 7 9 63 6 54.742 0.128


To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’


Examine boxplots of independent variables by cluster





Examine how antecedent and event conditions converge to influence N & P flux regimes:













How do our results differ if we choose the 2nd best SOM run?

## [1] The 2nd best run was:

Run rows cols Nodes Clusters npF QE
52 7 8 56 5 54.43617 0.123941


To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’